Skip to content

Issue/253: feat: support offline int8 kv cache quantization#254

Open
qinyiqun wants to merge 4 commits intomainfrom
Issue/253
Open

Issue/253: feat: support offline int8 kv cache quantization#254
qinyiqun wants to merge 4 commits intomainfrom
Issue/253

Conversation

@qinyiqun
Copy link
Contributor

@qinyiqun qinyiqun commented Mar 4, 2026

Support offline int8 kv cache quantization for static kv cache

@qinyiqun qinyiqun requested review from a team and wooway777 March 4, 2026 07:20
@qinyiqun qinyiqun changed the title Issue/253: feat: support custom KV cache dtype for quantization Issue/253: feat: support offline int8 kv cache quantization Mar 18, 2026
def __init__(self, max_batch_size: int = 1, max_cache_len: int = 0):
_infinilm.StaticKVCacheConfig.__init__(self, max_batch_size, max_cache_len)

def __init__(self, max_batch_size: int = 1, max_cache_len: int = 0, kv_cache_dtype: str | None = None):
Copy link
Collaborator

@PanZezhong1725 PanZezhong1725 Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最好额外提供一个支持框架内dtype的接口,并在python里提供parse映射,让使用python的用户可以通过进入python文件看到各个字符串的含义

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image vllm中在flash attention前都使用字符串,所以使用字符串我感觉问题不大,有注明即可

…quant.cpp; (2)update kv_cache_dtype handling; (3)Update Python test scripts
@PanZezhong1725 PanZezhong1725 force-pushed the Issue/253 branch 2 times, most recently from 203620f to ae78252 Compare March 20, 2026 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants